1 Introduction

R has a steep learning curve. It is often useful to cannibalize and modify someone else’s code rather than starting from scratch. I hope this page offers just such a resource. If you have a plot that you are particularly proud of, please feel free to send me the code. I will post it here and acknowledge you as author.

Often half the battle of R is reading the data in. If you plan to share a code snippet, consider posting your data to a public repository. You can send a link, and R will read the data in. This obviates unique working directories for each user. The ‘read_csv’ command from the readr library works pretty well. You can also link to Google Drive or a similar public URL, but Github seems to work the smoothest.

Here’s sample code that successfully reads data in from a CSV file posted on my lab’s Github repository.

url.dat <- read_csv("https://raw.githubusercontent.com/reilly-lab/reilly-lab.github.io/master/BoyGirl.csv", 
    col_names = TRUE)

Here are some libraries we will be using.

library(reshape2)  #melt and cast
library(tidyverse)  #ggplot2 and dplyr
library(readr)  #read in URLs
library(formatR)
library(knitr)  #knits this RMarkdown to HTML or whatever the hell else you want it to
library(gplots)
library(corrplot)  #correlation matrices and correlograms
library(tibble)  #clean little tibbles
library(RColorBrewer)  #creates custom color palettes
library(splitstackshape)  #generates ID variables, crazy useful little tool.
library(ggdendro)  #dendrograms
library(TTR)  #smoothing and simple moving averages for time series
library(ggthemes)  #tufte, economist, etc.
library(psych)  #describe_by
library(igraph)  #used here for converting hclust to igraph object (vector(s) of edges)
library(ggraph)  #plots cluster dendrogram as graph/network
library(dendextend)  #the mighty dendrogram
library(ggrepel)  #jitters points and labels a wee bit
library(DescTools)  #reorders factors
library(readxl)

2 Aesthetics, Themes, Graphical Parameters

2.1 My custom theme for ggplot2

Here’s a minimalist home brew of a theme for ggplot2. I’ll add this to most of the plots to follow. It strips panel gridlines and all sorts of other default junk.

jamie.theme <- theme_bw() + theme(axis.line = element_line(colour = "black"), 
    panel.grid.minor = element_blank(), panel.grid.major = element_blank(), 
    panel.border = element_blank(), panel.background = element_blank(), legend.title = element_blank())


2.2 Axis adjustments (limits, breaks, etc.)

ggplot2 can be a bit challenging regarding axes. Let’s first generate a dataframe populated with randomly sampled data from 1-100 without replacement. Then plot it.

set.seed(999)  #fixed random sampling
dat <- data.frame(replicate(2, sample(0:100, 100, rep = F)))  #selection without replacement
baseplot <- ggplot(dat, aes(X1, X2)) + geom_point(shape = 21, fill = "blue", 
    size = 2.3, alpha = 0.6) + jamie.theme + ylab(NULL) + xlab(NULL)
print(baseplot)

2.2.1 Specify user-defined axis breaks.

Yuck. We need finer-grained notation on both axes. Add breaks every 10.

newplot <- baseplot + scale_x_continuous(breaks = seq(0, 100, 10), limits = c(0, 
    100)) + scale_y_continuous(breaks = seq(0, 100, by = 10))
print(newplot)  #scale continuous is seq(from,to,by)

2.2.2 Set axis limits.

‘xlim’ & ‘ylim’ can cut off data. Tread lightly. Here is what xlim=50 looks like.

smaller <- baseplot + xlim(c(0, 50))
print(smaller)

2.2.3 Zoom in on a specific plot range.

Coord_cartesian zooms in on a specific range without cutting/eliminating data

focused <- baseplot + coord_cartesian(xlim = c(0, 25), ylim = c(0, 50))
print(focused)



2.3 Graphical parameters (par, mar) & Multiplot Grids

Let’s say we want all four scatterplots we’ve produced so far on one plot. There are many different ways of doing this. Facet wrap and facet grid come to mind. However, here’s another way that’s useful if the plots are of different types

# TBA



2.4 Annotation

Generally I cheat a bit at this point, porting figures over to Adobe Illustrator, embedding as a PDF and editing freehand from there. But here are some examples of some pretty useful annotation.

# TBA




3 Bar plots & boxplots

3.1 Boxplot

Boxplots. It’s easy to forget that the crossbar does not reflect a sample mean but instead reflects the median. The components of the box are Q1, Q2 (median), Q3. The whiskers represent the min, max. Let’s construct a boxplot from scratch from a bogus sample composed of 5 observations (e.g., kids shoe sizes): 3,4,5,6,7.

seq7 <- data.frame(seq(3, 7, 1))  #sequence from 3-7, by 1's
names(seq7)[1] <- paste("shoe.size")
summary(seq7)
##    shoe.size
##  Min.   :3  
##  1st Qu.:4  
##  Median :5  
##  Mean   :5  
##  3rd Qu.:6  
##  Max.   :7

Stat 101: The box is defined by the interquartile range (Q2-Q3). The crossbar is the median (here it’s 5). The whiskers represent max and min (that are not outliers). The trick to getting the whiskers here is to generate the error bar and then plop the boxplot over it. The width on the box and the whispers should match.

box <- ggplot(seq7, aes(x = "", y = shoe.size)) + stat_boxplot(geom = "errorbar", 
    width = 0.3) + xlab(NULL) + scale_y_continuous(breaks = seq(0, 10, 2), limits = c(0, 
    10)) + geom_boxplot(fill = "blue", width = 0.3) + jamie.theme
print(box)

Oh no. You forgot to grab shoe sizes for two other people. One is a 6 and the other is 26. Add them to the dataframe and re-run the boxplot. That lone point is an outlier, as defined by > Q3 + 1.5*(IQR).

names(seq7)[1] <- paste("shoe.size")
seq9 <- rbind(seq7, 6, 26)
seq9
##   shoe.size
## 1         3
## 2         4
## 3         5
## 4         6
## 5         7
## 6         6
## 7        26
outlier <- ggplot(seq9, aes(x = "", y = shoe.size)) + stat_boxplot(geom = "errorbar", 
    width = 0.3) + geom_boxplot(fill = "blue", outlier.stroke = 1, width = 0.3) + 
    xlab(NULL) + scale_y_continuous(breaks = seq(0, 30, 5), limits = c(0, 30)) + 
    jamie.theme
# important part here, X='' means there is no x-variable to map. It's like
# nulling that out.
print(outlier)




4 Correlograms & Correlation Plots

4.1 Correlogram

This one goes out to Dr. Joshua Troche, Assistant Professor at the University of Central Florida. Dr. Troche crowdsourced ratings from 300+ people on 16 semantic dimensions for 750 English words. The correlation matrices and plots to follow reflect the strength of the relationship(s) between factors like sound salience, visual form, loudness. For example, sound and visual form tend to be highly correlated when considering words like ‘dog’. Josh published this work in Frontiers in Human Neuroscience.

Coerce the first column to rownames or else you will get a big fat error when R tries to correlate a string with an integer.

MDS_Corr <- read.csv("JoshCorr.csv", header = T, row.names = 1)
tib <- as_tibble(MDS_Corr)
print(tib, n = 5)
## # A tibble: 751 x 16
##   Emotion Polarity Social Moral MotionSelf Thought Color TasteSmell
## *   <dbl>    <dbl>  <dbl> <dbl>      <dbl>   <dbl> <dbl>      <dbl>
## 1    4.52     4.85   4.24  4.65        5.1    5.19   1.9       1.64
## 2    3.66     4.03   3.3   3.29        2.7    3.53   2.4       1.64
## 3    3.09     3.6    2.62  2.62        3.5    3.15   2.6       1.64
## 4    5.41     5.52   3.97  4.74        4.2    4.72   1.9       1.64
## 5    2.9      3.33   3.3   2.84        3.2    3.53   2         1.64
## # ... with 746 more rows, and 8 more variables: Tactile <dbl>,
## #   VisForm <dbl>, Auditory <dbl>, Space <dbl>, Quantity <dbl>,
## #   Time <dbl>, Concreteness <int>, Imageability <int>
joshcor <- cor(MDS_Corr, use = "complete.obs")  #converts the dataframe to a correlation matrix

Here it is printed as a full Pearson correlation matrix

corrplot(joshcor, method = "number", diag = F, type = "upper", bg = "gray54", 
    cl.pos = "n", number.cex = 0.9, col = "black", outline = T, number.digits = 2, 
    addgrid.co = "black")

Generate the corrplot.

corrplot(joshcor, method = "shade", shade.col = NA, tl.col = "black", tl.srt = 45)



4.2 Corrplot varied by color and size

Here’s a correlation dotplot where the size of the circles indicate the strength of the correlation and the colors of the circles indicate the direction of the correlation (blue is negative, red is positive). Since this is a symmetrical matrix, I only included the upper diagonal.

corr <- read.csv("Exp1_Corrplot.csv", header = T)
corrmat <- cor(corr, use = "complete.obs")  #create a correlation matrix
colnew <- colorRampPalette(c("blue", "white", "red"))(10)  #creates a color palette ranging from blue to red in increments of 10
# generate a correlogram, upper only, by circle, color range by custom
# palette
corrplot(corrmat, method = "circle", type = "upper", tl.srt = 45, tl.cex = 1, 
    tl.col = "black", col = colnew, diag = F, order = "hclust")




5 Dendrograms

5.1 Steps & considerations

Nobody loves dendrograms more than my recently graduated MA thesis student, Helen Felker, and here’s one of her favorites. This ‘base’ unadorned cluster dendrogram represents semantic relations between words (N=75) rated in Spanish by a Spanish-English bilingual speaker on color, sound, emotion, size, time salience. Here’s the setup.

raw <- read.csv("Biling_Master_v10.csv", header = T)
raw_2 <- raw %>% select(8:9, 22:27)  #isolate only the columns of interest- dimensions, condition, participant
sp <- raw_2 %>% filter(Tag == "sp") %>% select(2:8)  #filters rows only corresponding to one condition 'sp' - spanish bilinguals
sp_means <- sp %>% group_by(Stim) %>% summarise_all(funs(mean(., na.rm = TRUE))) %>% 
    as.data.frame()
head(sp_means, n = 8)
##          Stim Color Sound Morality Valence Size Position
## 1   advantage  1.65  1.60     2.70    3.10 2.80     2.20
## 2      advice  1.35  1.75     4.85    4.45 2.30     1.10
## 3 appointment  1.55  1.65     2.65    2.90 2.35     2.80
## 4 arrangement  2.00  1.55     2.90    2.55 1.40     1.30
## 5      beauty  3.10  2.00     4.55    4.75 3.40     1.70
## 6         bed  3.50  2.65     1.20    3.60 3.10     3.15
## 7    behavior  1.65  1.95     5.80    4.65 2.45     1.55
## 8        bird  4.90  5.20     1.35    2.05 2.90     2.20

Getting there. Now we must coerce the first row to rownames. Rownames get plotted as the dendrogram leaves.

sp2 <- sp_means[, -1]  #create a new dataframe minus the first column (eliminate what will be rownames)
rownames(sp2) <- sp_means[, 1]  #coerce the first column (Word) into rownames of the dataframe
head(sp2, n = 8)
##             Color Sound Morality Valence Size Position
## advantage    1.65  1.60     2.70    3.10 2.80     2.20
## advice       1.35  1.75     4.85    4.45 2.30     1.10
## appointment  1.55  1.65     2.65    2.90 2.35     2.80
## arrangement  2.00  1.55     2.90    2.55 1.40     1.30
## beauty       3.10  2.00     4.55    4.75 3.40     1.70
## bed          3.50  2.65     1.20    3.60 3.10     3.15
## behavior     1.65  1.95     5.80    4.65 2.45     1.55
## bird         4.90  5.20     1.35    2.05 2.90     2.20

First convert the original data to a distance matrix (Euclidean), then hclust it using Ward’s Method. Then ‘as.dendrogram’ it. Then plot it. Here I struggled a bit with getting the branches of the dendrogram relatively homogeneous in length. The par(mar) function messes with plotting output using four parameters: bottom, left, top, and right. The default is c(5.1, 4.1, 4.1, 2.1).

sp.dist <- dist(sp2)  #coverts to distance matrix, runs hierarchical cluster analysis using complete linkage
sp.clust <- hclust(sp.dist, method = "ward.D")
sp.dend <- as.dendrogram(sp.clust)
# plt.dend <- ggdendrogram(sp.dend, rotate=TRUE)

plt.v1 <- sp.dend %>% set("labels_cex", 0.8) %>% set("branches_lwd", 1.5) %>% 
    set("leaves_cex", 1.5) %>% set("labels_col", "blue") %>% raise.dendrogram(2)
# par(mar=c(1, 5, 1, 7)) #changes graphical parameters
plot(plt.v1, horiz = T)

# dev.off() #resets to default graphical parameters

Here’s the same dendrogram plotted as a triangle. There’s lots of extra ‘junk’ space at the top. If we focus on the lower leaves, we might be able to discern a bit more structure.

plt.tri <- sp.dend %>% set("labels_cex", 0.8) %>% set("branches_lwd", 1.5) %>% 
    set("leaves_cex", 1.5) %>% set("labels_col", "blue") %>% raise.dendrogram(2)
plot(plt.tri, horiz = F, type = "triangle")

Zoom in on the previous dendrogram, Cut at 20. Show small leaf & branch nodes.

plt.tri2 <- sp.dend %>% set("labels_cex", 0.8) %>% set("branches_lwd", 1.25) %>% 
    set("leaves_cex", 1.25) %>% set("labels_col", "blue") %>% set("nodes_pch", 
    19) %>% set("nodes_cex", 0.5)
# plot(plt.tri2, horiz=F, type='triangle')
plot(cut(plt.tri2, 20)$lower[[2]], horiz = F, type = "triangle")



5.2 Dendrogram to Undirected Graph

Let’s say for kicks and giggles we want to examine these relations as a network rather than as a dendrogram. We’ll use the ggraph package that can take a dendrogram and convert it to an igraph object, then label the nodes. This network looks pretty decent in terms of collocating words that are semantically related near each other. I would just feel weird about publishing a network figure like this because the math that is going on under the hood is like witchcraft to me. Regular hierarchical cluster analysis makes sense, but graph theory doesn’t quite gel.

sp.graph <- den_to_igraph(sp.dend, even = FALSE)
v1 <- ggraph(sp.graph) + geom_edge_link() + geom_node_point(color = "red") + 
    geom_node_text(aes(label = label), repel = T) + jamie.theme
print(v1)



6 Heatmaps

6.1 Custom color palette, no dendrogram or variable ordering

Oh the beauty of a nice heatmap. Here’s one using our semantic data where I’ve played with the color spectrum varying it from red (high) to blue (low) by way of yellow in 299 hue increments. These data represent average Likert-scale ratings on multiple semantic dimensions (color, sound, etc.) for three English words (kite, alligator, reputation).
Prep the dataframe by changing it to a matrix, coercing column one to rownames

dat.t <- read_excel("Tiles2.xlsx")
dat.d <- as.data.frame(dat.t)
dat.1 <- dat.d[, -1]  #create a new dataframe minus the first column (eliminate what will be rownames)
rownames(dat.1) <- dat.d[, 1]
mat.1 <- as.matrix(dat.1)
print(mat.1)
##            Color Taste Tactile VisForm Auditory Space Quantity Emotion
## alligator    4.9  3.55     4.2     6.0      4.0  2.38     2.58    1.93
## kite         4.0  2.64     5.1     6.5      3.4  3.19     2.33    2.24
## reputation   1.5  1.64     2.2     1.9      1.5  3.25     4.42    4.17
##            Polarity Social Moral Motion
## alligator      3.12   1.55  1.58    1.2
## kite           3.15   2.12  1.52    1.9
## reputation     4.91   4.76  4.32    3.4

Now work the magic by specifying a color vector. heatmap.2 is finicky with the margins and doesn’t seem to use the par functions.

jamie.colors <- colorRampPalette(c("yellow", "red"))(n = 299)
heatmap.2(mat.1, key.title = NA, dendrogram = "none", trace = "none", col = jamie.colors, 
    density.info = "none", margins = c(8, 8), cexRow = 1.5)



7 Histograms & Density Plots

7.1 Change bin size & scale

Here’s something you don’t see every day. This histogram reflects counts from Likert-Scale ratings of the quality of pairing common English nouns (e.g., rocket) with taboo words (e.g., shit) to form new taboo compounds (e.g., shitrocket). In this particular histogram, I changed the binwidth to .10 and the y-axis to raw counts.

profexp2 <- read.csv("ProfanityExp2_CleanR.csv", header = T)
histprof <- ggplot(profexp2, aes(x = profexp2$Qualtrics)) + geom_histogram(aes(y = ..count..), 
    colour = "black", fill = "goldenrod2", binwidth = 0.1) + xlab("Average Likert Scale Rating") + 
    ylab("Word Count Per Bin") + jamie.theme + ggtitle("profane noun compounds")
histprof



7.2 Density plots

A density plot is a smoothed histogram. They’re quite pretty. Let’s say you want to examine two distributions, each composed of 1000 observations. It’s Philly kids vs NYC kids on some crazy standardized test. Let’s say the Philly kids score a mean of 80 (sd=2), whereas NYC kids score a mean of 90 (sd=5). Here’s our fake data setup.

set.seed(123)
philly <- rnorm(1000, 80, 2)
set.seed(1234)
nyc <- rnorm(1000, 87, 5)
all <- data.frame(cbind(philly, nyc))
phlnyc <- all %>% melt(measure.vars = 1:2, variable.name = "city", value.name = "test.score")  #Melt that dataframe into long form to prep for facetting.
head(all)
##     philly      nyc
## 1 78.87905 80.96467
## 2 79.53965 88.38715
## 3 83.11742 92.42221
## 4 80.14102 75.27151
## 5 80.25858 89.14562
## 6 83.43013 89.53028

let’s check descriptives.

describeBy(phlnyc, group = phlnyc$city)
## 
##  Descriptive statistics by group 
## group: philly
##            vars    n  mean   sd median trimmed  mad   min   max range skew
## city*         1 1000  1.00 0.00   1.00    1.00 0.00  1.00  1.00   0.0  NaN
## test.score    2 1000 80.03 1.98  80.02   80.02 1.93 74.38 86.48  12.1 0.07
##            kurtosis   se
## city*           NaN 0.00
## test.score    -0.08 0.06
## -------------------------------------------------------- 
## group: nyc
##            vars    n  mean   sd median trimmed  mad   min    max range
## city*         1 1000  2.00 0.00    2.0    2.00 0.00  2.00   2.00  0.00
## test.score    2 1000 86.87 4.99   86.8   86.83 4.76 70.02 102.98 32.96
##             skew kurtosis   se
## city*        NaN      NaN 0.00
## test.score -0.01     0.24 0.16

Here’s what the two distributions look like facetted as histograms.

col.vec <- c("goldenrod2", "blue")
histphlnyc <- ggplot(phlnyc, aes(x = test.score, fill = city)) + geom_histogram(aes(y = ..count..), 
    colour = "black", binwidth = 0.5) + scale_fill_manual(values = col.vec) + 
    xlab("Test.Score") + ylab("Count") + facet_wrap(~city) + theme_bw()
print(histphlnyc)

Hmm… but we want to eyeball the two distributions in the same ‘space’. Enter density plot.

dense <- ggplot(phlnyc, aes(x = test.score, fill = city)) + geom_density(colour = "black") + 
    scale_fill_manual(values = col.vec) + xlab("Test.Score") + ylab("Count") + 
    jamie.theme
print(dense)

Gross. Let’s play with the aesthetics, rid ourselves of the black outlines and adjust the alpha. This one looks better.

adj.alpha <- ggplot(phlnyc, aes(x = test.score, fill = city)) + geom_density(alpha = 0.5, 
    colour = "white") + scale_fill_manual(values = col.vec) + xlab("Test.Score") + 
    ylab("Count") + jamie.theme
print(adj.alpha)




8 Line graphs & time series plots



8.1 Smoothing functions & error bars

These curves reflect three response functions of baseline corrected change in pupil size (in mm) elicited during a word monitoring task when people stare at either an unchanging gray screen in darkness, mid-level luminance, or obnoxiously bright light. The data are in long form, factors are reordered (dark, mid, bright).

summary.words <- read.csv("Exp2_RSetupPostPipeline_1.0.csv", header = T)
summary.words$lightCond <- factor(summary.words$lightCond, levels = c("dark", 
    "mid", "bright"))
newvec <- c("black", "red", "yellow")
pupils <- ggplot(summary.words, aes(x = bin, y = raw.diff, color = lightCond, 
    fill = lightCond)) + geom_smooth(method = loess, alpha = 0.4) + scale_colour_manual(values = newvec) + 
    scale_fill_manual(values = newvec) + coord_cartesian(ylim = c(-0.05, 0.2)) + 
    coord_cartesian(ylim = c(-0.05, 0.2)) + ylab("Absolute Pupil Dilation (mm)") + 
    xlab("Event Duration (bin)") + scale_x_continuous(breaks = pretty(summary.words$bin, 
    n = 12)) + jamie.theme
print(pupils)

8.1.1 Facet-wrapped line graph w/ Loess smoother

This is a pretty little multiplot ofa study now underway where Spanish-English speakers rate emotional salience, color, etc. for the same set of words (translation equivalents) once in Spanish and a week later again in English. First steps are to get the data read in and in a form we can use for passing to ggplot2.

raw <- read.csv("Biling_Master_v10.csv", header = T)
raw_cond <- raw %>% select(3, 13, 22:27)  #select the range of columns we need
cond_melt <- raw_cond %>% melt(id.vars = c(1, 2))  #melt into long form
head(cond_melt, n = 5)
##     Condition CNC.av variable value
## 1 Monolingual  2.915    Color     1
## 2 Monolingual  2.730    Color     1
## 3 Monolingual  3.000    Color     1
## 4 Monolingual  2.300    Color     1
## 5 Monolingual  3.005    Color     7

Plot it.

colslope <- c("green", "gray")  #creates a vector of colors to pass to aes
ggplot(cond_melt, aes(x = CNC.av, y = value, fill = Condition, color = Condition)) + 
    geom_smooth(method = "loess") + scale_fill_manual(values = colslope) + scale_color_manual(values = colslope) + 
    facet_wrap(~variable, ncol = 2) + xlab("concreteness") + ylab("likert rating") + 
    jamie.theme

8.1.2 Standard error bars (stat_summary)

Here’s a line graph representing group level pupillary responses when people hear taboo words (e.g., dick) relative to matched but arousing anatomical terms (e.g., penis) and neutral terms (e.g., arm).

pupcurse <- read.csv("exp1_pupilgood_noNW.csv", header = T)
# melts the original wide dataframe using columns 1-5 as measured vars and
# names the bin variable
pupmelt <- melt(pupcurse, id.vars = c("Subject", "Word", "WordID", "Condition", 
    "Lexicality"), variable.name = "time_bin")
# changes time_bin variable from factor to numeric
pupmelt$time_bin <- as.numeric(pupmelt$time_bin)
# Plots word type line graph raw tim series: neutral, profane, technical
f3raw <- ggplot(pupmelt, aes(x = time_bin, y = value, color = Condition)) + 
    stat_summary(fun.y = mean, geom = "line", size = 0.5) + scale_colour_manual(values = c("green", 
    "black", "red")) + stat_summary(fun.data = mean_se, geom = "pointrange", 
    size = 0.2) + theme_bw() + ylab("Pupil Diameter Change (mm)") + xlab("Time Bin") + 
    theme(legend.position = c(0.85, 0.85)) + jamie.theme
print(f3raw)

8.1.3 Pointrange Reflecting Error

These are some interesting data from my former MA student Victoria Diedrichs’ thesis. These pupil response functions reflect Yes/No reponses when people thing of darkness (dark cave) associated with “No” and brightness (sunny day) associated with “Yes”.

pup.yes <- read_csv("BrightDark.csv")
pup.melt <- melt(pup.yes, id = 1:3)
pup.melt$Time <- 125 * (as.numeric(pup.melt$variable) - 1)
mycolors <- c("goldenrod2", "black")
ggplot(subset(pup.melt, Time > 0), aes(Time, value, color = COND)) + stat_summary(fun.data = mean_se, 
    geom = "pointrange", lwid = 0.5, size = 0.3) + scale_color_manual(values = mycolors) + 
    jamie.theme



8.2 Time Series: A Sad Saga of Bad Programming

This was an attempt to overlay two time series reflecting 30seconds of pupillary measurement (sampling rate 120Hz) for two of my labbies (Jill and Liz). There are 3450 samples for Liz and Jill. The time series are non-stationary, and there are missing observations associated with blinks. Let’s look at the raw data.

pup <- read.csv("pup_raw.csv", header = T)
head(pup, n = 5)
##    liz jill
## 1 4.45 5.10
## 2 4.45 5.08
## 3 4.45 5.07
## 4 4.45 5.07
## 5 4.45 5.06

Melt the dataframe into long form.

pup.m <- pup %>% melt(measure.vars = 1:2, value.name = "pupil.size", variable.name = "labbie")
head(pup.m, n = 4)  #hmmm.... need to add a counter variable to plot along x-axis
##   labbie pupil.size
## 1    liz       4.45
## 2    liz       4.45
## 3    liz       4.45
## 4    liz       4.45

I’m stuck. I have a long vector of pupil diameters (y-values), but there are no corresponding x-axis values. I need to create a sequence of numbers 1-length(x) by each level of the factor. I looked at seq_along and a few other options and realized I had no idea how to do this. Deep breath. If I can get something to work on a teeny dataframe, it’ll scale up to this big time series, right?

vec1 <- c(rep("a", 5), rep("b", 5))
vec2 <- c(6, 9, NA, 1, 3, NA, 8, 2, 10, NA)
dat <- data.frame(cbind(vec1, vec2))
dat$vec2 <- as.numeric(dat$vec2)
dat
##    vec1 vec2
## 1     a    5
## 2     a    7
## 3     a   NA
## 4     a    1
## 5     a    4
## 6     b   NA
## 7     b    6
## 8     b    3
## 9     b    2
## 10    b   NA

Success! We have a fake dataframe. Now we need a sequential ID variable that starts with Jill (1 to the end) and then restarts with Liz (1 to 3450). here’s my first try using seq_along

dat$try <- seq_along(dat$vec1)
dat
##    vec1 vec2 try
## 1     a    5   1
## 2     a    7   2
## 3     a   NA   3
## 4     a    1   4
## 5     a    4   5
## 6     b   NA   6
## 7     b    6   7
## 8     b    3   8
## 9     b    2   9
## 10    b   NA  10

Burn. It didnt work. ‘try’ is just a sequence of 1-10 along the whole range ignoring the separate levels of the factor. After some Googling, I tried these different methods. Nothing worked. I almost resorted to writing a loop.

dat$maybe <- with(dat, ave(vec2, FUN = seq_along))
dat$perhaps <- rleid(dat$vec1)
dat$whoknows <- sequence(rle(df$vec1)$lengths)

Then I recalled that REM song about ‘holding on’ and eventrually discovered the ‘getanID’ function from the splitstackshape package. Look what it does – behold the .id column - exactly what we need.

print(dat.n <- getanID(data = dat, id.vars = "vec1"))
##     vec1 vec2 try .id
##  1:    a    5   1   1
##  2:    a    7   2   2
##  3:    a   NA   3   3
##  4:    a    1   4   4
##  5:    a    4   5   5
##  6:    b   NA   6   1
##  7:    b    6   7   2
##  8:    b    3   8   3
##  9:    b    2   9   4
## 10:    b   NA  10   5

Now that I know how that works, we can apply that function to our much longer time series to create a sampling variable.

pup.numbered <- getanID(data = pup.m, id.vars = "labbie")  #creates a sequence of 1:Length(X) per level of grouping variable
# data.frame(pup.numbered) #changes the data table to a dataframe, probably
# not necessary
setnames(pup.numbered, ".id", "time")  #renames .id to 'time'
head(pup.numbered, n = 8)
##    labbie pupil.size time
## 1:    liz       4.45    1
## 2:    liz       4.45    2
## 3:    liz       4.45    3
## 4:    liz       4.45    4
## 5:    liz       4.45    5
## 6:    liz       4.45    6
## 7:    liz       4.44    7
## 8:    liz       4.45    8

Mission accomplished? Now let’s plot it. Sure it’s a bit ugly as most complex time series are.

mycols <- c("goldenrod2", "blue")
ggplot(pup.numbered, aes(x = time, y = pupil.size, color = labbie)) + geom_line() + 
    scale_color_manual(values = mycols) + scale_x_continuous(breaks = seq(0, 
    max(pup.numbered$time), by = 500), limits = c(0, 3500)) + ylab("pupil diameter") + 
    xlab("time/samples") + jamie.theme

8.2.1 After all that…plotting a time series with base R

After all that craziness creating a fake time variable to plot that pupil time series, I discovered that base R does it for free without any of that madness. Here’s a continuous recording of pupil diameter over 3000samples (a relatively stationary time series)

pupil <- read.csv("PupilRise.csv", header = T)  # reads in file
head(pupil)
##   size
## 1 4.45
## 2 4.45
## 3 4.45
## 4 4.45
## 5 4.45
## 6 4.45
pupil.ts <- as.ts(pupil)  #recodes the first column into a time series (3000 observations of pupil diameter fluctuations)
head(pupil.ts, n = 5)
## Time Series:
## Start = 1 
## End = 5 
## Frequency = 1 
##      size
## [1,] 4.45
## [2,] 4.45
## [3,] 4.45
## [4,] 4.45
## [5,] 4.45

Let’s first make sure that ts(dat) changed the dataframe to a time series.

str(pupil.ts)
##  Time-Series [1:3596, 1] from 1 to 3596: 4.45 4.45 4.45 4.45 4.45 4.45 4.44 4.45 4.45 4.44 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "size"

Ach ja. Here’s the base R plotting function raw and unadorned.

# Creates a time series plot of the continuously sampled pupil diameter
# measurements, scales the y-axis from 2 to 8mm, colors the line red,
# relabels the axes
plot.ts(pupil.ts, ylim = c(3, 5), xlim = c(0, 3000), ylab = "pupil dm (mm)", 
    xlab = "samples/time", col = "red")



8.3 Smoothing

Oh really, reviewer 3? You would like to see a smoother time series? Oh, of course you would. Let’s apply a simple moving average to the original time series and then replot it. TTR’s simple moving average (SMA) function averages each new datapoint with the N data points before it to create a new smoothed time series. This takes care of “weird” artifacts like your eyetracker behaving crazily. Here’s what smoothing at a backward window of 10 items looks like – less jagged than the original.

ts.smooth <- ts(SMA(pupil$size, n = 10))
plot.ts(ts.smooth, ylim = c(3, 5), xlim = c(0, 3000), ylab = "pupil dm (mm)", 
    xlab = "samples/time", col = "blue", lwid = 3)

Smoothing at a 20 item window.Now that’s just madness.

ts.smooth <- ts(SMA(pupil$size, n = 20))
plot.ts(ts.smooth, ylim = c(3, 5), xlim = c(0, 3000), ylab = "pupil dm (mm)", 
    xlab = "samples/time", col = "purple", lwid = 3)

8.3.1 Generic but instructive

Just means across time. I eliminated the x-axis line and redrew it as dashed.

lvppa <- read.csv("ancdsDat.csv", header = T)
ggplot(lvppa, aes(x = lvppa$Month, y = lvppa$Accuracy, colour = lvppa$Item)) + 
    geom_line(size = 1) + theme_bw() + theme(axis.line = element_line(colour = "black"), 
    panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank()) + 
    coord_cartesian(ylim = c(0, 1.1)) + geom_hline(aes(yintercept = 0), linetype = "dashed") + 
    theme(axis.line.x = element_blank()) + theme(panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank()) + ylab("% Accuracy") + xlab("Time (months)") + 
    geom_point(shape = 17, size = 3) + labs(colour = "")




9 Scatterplots



9.1 Barplot & boxplot alternative

Here’s a nice alternative to boxplots, reflecting individual differences in average uncorrected pupil diameters for a bunch of neurotypical adults measured in very bright, medium, and dark ambient light. Conditions are colored by a custom vector (in code block below as ‘newvec’). Here are the first 5 rows of the molten dataframe.

scat.raw <- read.csv("Exp2_PlotUncorrected.csv", header = T)
scat.raw.melt <- scat.raw %>% melt(measure.vars = 2:4)
head(scat.raw.melt, n = 5)
##   participant variable    value
## 1           2      low 3.861133
## 2           3      low 3.713745
## 3           4      low 4.012212
## 4           5      low 4.004413
## 5           6      low 4.000240

plot it. the tricky part is getting R to plot the crossbars as means.

newvec <- c("black", "red", "yellow")
ggplot(scat.raw.melt, aes(variable, value, shape = variable, fill = variable)) + 
    geom_point(shape = 21, color = "black", size = 2.3, alpha = 0.6, position = position_jitter(w = 0.03, 
        h = 0)) + scale_fill_manual(values = newvec) + ylab("Raw Pupil Size (mm)") + 
    ylim(c(2, 5)) + stat_summary(fun.y = mean, fun.ymin = mean, fun.ymax = mean, 
    geom = "crossbar", color = "black", size = 0.4, width = 0.3) + jamie.theme



9.2 Applying labels to points

Here’s the worst scatterplot you’ll ever see. This one is for a project we are running on physiological responses to profanity using pupillometry. I used the repel function from ggrepel to nudge the labels so that there is no overlap. There will be jail time.

profane <- read.csv("profane.csv", header = T)
nat <- ggplot(profane, aes(Social_Accept, Arousal, color = Condition)) + geom_point(size = 3) + 
    ylab("Physiological Arousal") + xlab("Social Acceptibility") + scale_color_manual(values = c("black", 
    "green")) + geom_text_repel(aes(label = Word), color = "black", size = 3) + 
    theme(legend.position = c(0.85, 0.85)) + theme(legend.background = element_rect(fill = "white", 
    color = "black")) + theme(axis.line = element_line(colour = "black"), panel.border = element_blank(), 
    panel.background = element_blank()) + theme(panel.grid.minor = element_line(colour = "gray", 
    size = 0.1))
nat



9.3 Trendline examples

This scatterplot shows the correlation between concreteness and imageability across many nouns from the MRC Psycholinguistic Database. For some reason CNC gets read in as a factor and must be coerced to integer for this to work. The smoother here uses a Loess function. Since N>1000, this is probably ok.

cnc <- read.csv("ImgBigSet.csv", header = T)
cnc$CNC <- as.integer(cnc$CNC)
ggplot(cnc, aes(x = CNC, y = IMG)) + geom_point(shape = 2, size = 1, color = "Blue", 
    alpha = 0.7) + stat_smooth(method = loess, color = "red") + xlab("Concreteness") + 
    ylab("Imageability") + scale_x_continuous(breaks = seq(0, 600, by = 100)) + 
    scale_y_continuous(breaks = seq(0, 600, by = 100)) + jamie.theme



9.4 Facet-wrapping examples

This plot also has some nice jitter, colors, and opacity on the points.These data represent different measures of pupil size. Some steps in structuring the data….

scat2 <- read.csv("Exp2_Scattersetup_2.0.csv", header = T)
scat.m <- scat2 %>% melt(id.vars = 5, measure.vars = 2:4)  #melt the original dataframe to long form
scat.m$variable <- factor(scat.m$variable, levels = c("low", "mid", "high"))  #reorder the factors 
head(scat.m, n = 5)
##   condition variable      value
## 1      Peak     high 0.39775642
## 2      Peak     high 0.26944261
## 3      Peak     high 0.09137426
## 4      Peak     high 0.35118977
## 5      Peak     high 0.09760123

Now plot it facetting by condition. NB all the y-axis have different scales and must scale free.

newvec <- c("black", "red", "yellow")
scat.facet <- ggplot(scat.m, aes(variable, value, shape = variable, fill = variable)) + 
    geom_point(shape = 21, color = "black", size = 2.3, alpha = 0.6, position = position_jitter(w = 0.17, 
        h = 0)) + scale_fill_manual(values = newvec) + facet_wrap(~condition, 
    scales = "free_y") + ylab("Pupil Dilation (mm)")
scat.facet




10 Frankenplots: The Island of Lost Toys